23 research outputs found
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming next-generation accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure
Associated Legendre Polynomials and Spherical Harmonics Computation for Chemistry Applications
Associated Legendre polynomials and spherical harmonics are central to
calculations in many fields of science and mathematics - not only chemistry but
computer graphics, magnetic, seismology and geodesy. There are a number of
algorithms for these functions published since 1960 but none of them satisfy
our requirements. In this paper, we present a comprehensive review of
algorithms in the literature and, based on them, propose an efficient and
accurate code for quantum chemistry. Our requirements are to efficiently
calculate these functions for all non-negative integer degrees and orders up to
a given number (<=1000) and the absolute or the relative error of each
calculated value should not exceed 10E-10. We achieve this by normalizing the
polynomials, employing efficient and stable recurrence relations, and
precomputing coefficients. The algorithm presented here is straightforward and
may be used in other areas of science.Comment: The 40th Congress on Science and Technology of Thailand (STT40
Efficient update of ghost regions using active messages
The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course of a computation. X10 is a modern par
PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language
The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchr
AIWC: OpenCL-Based architecture-independent workload characterization
Measuring performance-critical characteristics of application workloads is important both for developers, who must understand and optimize the performance of codes, as well as designers and integrators of HPC systems, who must ensure that compute architectures are suitable for the intended workloads. However, if these workload characteristics are tied to architectural features that are specific to a particular system, they may not generalize well to alternative or future systems. An architecture-independent method ensures an accurate characterization of inherent program behaviour, without bias due to architecture-dependent features that vary widely between different types of accelerators. This work presents the first architecture-independent workload characterization framework for heterogeneous compute platforms, proposing a set of metrics determining the suitability and performance of an application on any parallel HPC architecture. The tool, AIWC, is a plugin for the open-source Oclgrind simulator. It supports parallel workloads and is capable of characterizing OpenCL codes currently in use in the supercomputing setting. AIWC simulates an OpenCL device by directly interpreting LLVM instructions, and the resulting metrics may be used for performance prediction and developer feedback to guide device-specific optimizations. An evaluation of the metrics collected over a subset of the Extended OpenDwarfs Benchmark Suite is also presented